Identification of diagnostic candidates in Mendelian disorders using an RNA sequencing-centric approach

Background RNA sequencing (RNA-seq) is increasingly being used as a complementary tool to DNA sequencing in diagnostics where DNA analysis has been uninformative. RNA-seq enables the identification of aberrant splicing and aberrant gene expression, improving the interpretation of variants of unknown significance (VUSs), and provides the opportunity to scan the transcriptome for aberrant splicing and expression in relevant genes that may be the cause of a patient’s phenotype. This work aims to investigate the feasibility of generating new diagnostic candidates in patients without a previously reported VUS using an RNA-seq-centric approach. Methods We systematically assessed the transcriptomic profiles of 86 patients with suspected Mendelian disorders, 38 of whom had no candidate sequence variant, using RNA from blood samples. Each VUS was visually inspected to search for splicing abnormalities. Once aberrant splicing was identified in cases with VUS, multiple open-source alternative splicing tools were used to investigate if they would identify what was observed in IGV. Expression outliers were detected using OUTRIDER. Diagnoses in cases without a VUS were explored using two separate strategies. Results RNA-seq allowed us to assess 71% of VUSs, detecting aberrant splicing in 14/48 patients with a VUS. We identified four new diagnoses by detecting novel aberrant splicing events in patients with no candidate sequence variants from prior DNA testing (n = 32) or where the candidate VUS did not affect splicing (n = 23). An additional diagnosis was made through the detection of skewed X-inactivation. Conclusion This work demonstrates the utility of an RNA-centric approach in identifying novel diagnoses in patients without candidate VUSs. It underscores the utility of blood-based RNA analysis in improving diagnostic yields and highlights optimal approaches for such analyses. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-024-01381-w.


Background
With the advancement of next-generation sequencing, vast amounts of DNA sequencing data are continually generated to aid in the diagnosis and treatment of rare diseases.However, our ability to interpret genomic data has not grown at the same rate.The diagnostic yield of whole exome sequencing (WES) and whole genome sequencing (WGS) alone remains relatively low, leaving significant room for improvement [1][2][3][4].For instance, within the Genomics England 100,000 Genomes Project, the average diagnostic yield using WGS is around 25% [1].More broadly, diagnostic yield estimates for WGS range from 19.1 to 68.3%, while estimates for WES vary from 6.7 to 72.2% [3].RNA sequencing (RNA-seq) is now being used as a complementary tool to DNA sequencing for diagnostic genetic testing in rare diseases where DNA analysis alone has failed to identify a clear diagnosis [5][6][7][8][9][10][11][12][13][14][15].While some studies have focused on specific disorder types, such as mitochondrial disease [6,13], muscle disorders [5] and neurodevelopmental disorders [14], others have looked at heterogeneous disease populations [8,10,11,15,16].Unlike DNA sequencing, high-throughput RNA-seq is a qualitative and quantitative approach which allows the identification of aberrant splicing (AS), aberrant gene expression and monoallelic expression, allowing improved interpretation of variants of unknown significance (VUSs).An important benefit of the transcriptomic approach as compared to targeted reverse transcription PCR (RT-PCR) is that it is agnostic to the resulting abnormally spliced transcript, whereas RT-PCR must rely on targeted primer designs that are intrinsically limited by factors such as known gene annotations, PCR amplicon lengths and expected aberrant splicing event.RNA-seq therefore provides the opportunity not only to look at the splicing effects of known VUSs but also to scan the transcriptome for abnormal splicing events and expression abnormalities in other relevant genes that may be the cause of a patient's phenotype.This in turn allows the identification of molecular diagnoses in patients in which standard genomic DNA testing has not identified any candidate.
In this study, we have used RNA-seq to (a) investigate the feasibility of generating new diagnostic candidates in a subset of patients with no previously reported candidate VUSs in clinically relevant genes and (b) assess the use of blood as the tissue of choice in the implementation of an RNA-seq clinical pipeline to improve diagnostic yield of patients with rare diseases.

Patient recruitment
Participants were enrolled on the University of Southampton's Splicing and Disease study with appropriate ethical approval (REC 11/SC/0269, IRAS 49685, ERGO 23056).This cohort was comprised of rare disease patients assessed by UK clinical genetics services in whom a candidate VUS may or may not have been identified through conventional DNA-based testing (n = 86).Within this cohort, 48 individuals had pre-existing candidate VUSs (n = 51 variants) that had been clinically reported within genes of potential clinical relevance.Eighteen of the 51 VUSs have been previously assessed by RT-PCR [12].Thirty-two cases had unknown molecular diagnoses with no previously reported candidate VUSs in clinically relevant genes.Individuals without a molecular diagnosis had a phenotype where a genetic cause was suspected, and previous genetic testing was negative.During the study, six cases received a diagnosis not related to aberrant splicing, four with array deletions and two with indel mutations in the PURA gene.We have categorised these as cases with known genetic findings.Details about sample collection and RNA extraction are described in Additional file 1.

RNA sequencing
RNA samples were sequenced via Novogene (Hong Kong) in four separate batches (comprising 7, 16, 33 and 30 samples) using a total RNA-seq approach employing the NEBNext rRNA Depletion Kit and the NEBNext Ultra Directional RNA Library Prep Kit (New England Biolabs, MA).Samples in batches 1, 2 and 4 also had NEBNext Globin Depletion Kit applied, whereas those in batch 3 did not.The library was checked with Qubit and real-time PCR for quantification and bioanalyser for size distribution detection.On average 76 million 150 basepair paired-end reads were generated for each sample on a HiSeq 2000 instrument (Illumina, CA).FASTQ files underwent initial quality control filtering and adapter sequence removal by Novogene.Filtering included the removal of reads containing N > 10% (N: bases that cannot be determined) and reads with over 50% of low-quality bases (Qscore ≤ 5).Subsequent alignment was performed to the human genome reference (GRCh38) with annotations from GENCODE [17] release 38 using STAR aligner [18] v2.6.1c with optimised parameters via the University of Southampton's IRIDIS5 high-performance computing clusters.Alignment scripts can be found on GitHub (https:// github.com/ caroj oquen do/ RNA_ splic ing_ and_ disea se).
Where possible RT-PCR and Sanger sequencing were used to validate events (see Additional file 1).
The MRSD web portal (https:// mcgm-mrsd.github.io/) was used to predict the minimum number of sequencing reads required from RNA-seq experiments to confidently determine aberrant splicing events for a gene of interest [19].Default values for confidence level (95%) and splice junction proportion (75%) were used.For coverage, a minimum of five reads were used (n = 5).The online SpliceAI server (https:// splic eailo okup.broad insti tute.org/) was used to predict the splicing effect of all variants [20].

Assessment of aberrant splicing in cases with VUS and known molecular diagnosis
Due to the nature of the cohort, the assessment of aberrant splicing and expression was done in stages.Cases with VUSs and known events were assessed first followed by an assessment of cases with no VUS (Fig. 1).
To determine the functional consequence at a transcript level for each VUS, RNA-seq data was loaded into the Integrative Genomics Viewer (IGV) [21] and each variant was visually inspected to search for splicing abnormalities compared against two random samples within the cohort.To rule out mapping errors, new junctions required at least five supporting reads and ≤ 2 reads across any other sample within the cohort.If there were no splicing abnormalities in the exons and introns flanking the variant, it was determined that there were no splicing abnormalities resulting from the variant.Splicing abnormalities were classed as exon skipping, inclusion of cryptic exon, intron retention, alternative 5′ (donor) splice site and alternative 3′ (acceptor) splice site.The command-line tool ggsashimi [22] was used to create final sashimi plots to visualise junctions.RT-PCR was carried out in parallel to assess VUSs where possible.In some cases, RT-PCR was not carried out due to technical limitations (i.e.exon was too big, TPM = 0, single exon gene).
Fig. 1 Overview of methods.1.Data generation.This step was generally the same for all samples (n = 86).The only difference came in the library preparation stage where samples in batches 1, 2 and 4 also had globin depletion, whereas samples in batch 3 did not.2. Cases with a VUS were assessed first.Each variant of unknown significance (VUS) was visually inspected to search for splicing abnormalities in the Integrative Genomics Viewer (IGV).RT-PCR and Sanger sequencing were carried out in parallel for additional validation.3. Cases without a clinically relevant candidate variant were investigated last.This also included cases for which the original candidate VUS had not been found to alter splicing.The filtering strategy was determined based on observations across cases with a VUS.Results were visualised in IGV, and new diagnostic candidates were validated with RT-PCR and Sanger sequencing Additional details on RT-PCR results can be found in Additional file 2: Table S1.
Once aberrant splicing events had been ascertained in cases that had a VUS or known molecular diagnosis, we used this data to identify open-source tools best placed to identify potential aberrant splicing in cases for which there was no candidate variant.FRASER2 [23,24], rMATS-turbo v4.1.2[25], MAJIQ v2.4 [26] and LeafCut-terMD v0.2.9 [27] were used to detect aberrant splicing across all samples.The tools chosen are some of the most commonly used for splicing analyses, where rMATSturbo and MAJIQ are events-based methods while Leaf-CutterMD and FRASER2 are outlier approaches.We decided to use tools with two different underlying methodologies as there is still no gold standard for identifying splicing events in this type of cohort.For all tools except FRASER2, each sample was compared against other samples within the same batch except samples in batches 1 and 2 which were combined to increase power.rMATSturbo was run with additional parameters -novelSS to enable detection of novel splice sites, as well as -allowclipping to allow alignments with soft or hard clipping to be used.MAJIQ modules build and DeltaPSI were run with default parameters using the GENCODE v38 annotation gff3 files.The DeltaPSI results were then input into the MAJIQ voila module which provides a tab-delimited text file to allow parsing of the MAJIQ result and filters out local splice variations (LSVs) with no junctions predicted to change over a certain value.Default parameters for the voila module were used.LeafCutterMD and FRASER2 were also run with default parameters and results were annotated with gene symbols to extract genes and loci of interest for each sample.Once samples had been run through all the tools, events within the gene of interest were extracted to determine if the tools had been able to pick up what had been seen in IGV.This allowed us to check concordance between the tools and identify thresholds that could be used later when looking for events in cases without a VUS.

Assessment of aberrant splicing in patients without candidate VUSs
After cases with VUSs and known events were assessed, we investigated those without a clinically relevant candidate variant.This also included cases for which the original candidate VUS had not been found to alter splicing.The aligned BAM files were run through rMATS-turbo as well as through the GATK's Best Practices workflow for RNA-seq short variant discovery (https:// gatk.broad insti tute.org/ hc/ en-us/ artic les/ 36003 55311 92-RNA-seqshort-varia nt-disco very-SNPs-Indels-) to identify variants in the RNA-seq.
To reduce spurious calls, VCF files were run through GATK's VariantFiltration tool, keeping calls with a minimum quality score of 50.bcftools [29] was used to further filter variants excluding any variants with (a) less than eight reads covering the locus; (b) calls with genotype quality lower than 16; (c) calls with strand bias (FS metric) greater than 30; and (d) variants with a quality normalised by depth of at least two [30].After filtering, Ensembl's VEP v103 [31] was used to annotate the variants with additional information including but not limited to the nearest gene, variant consequence (e.g.missense, splice_region) and minor allele frequency (MAF).The SpliceAI VEP plugin was used to produce a score per variant (delta score) based on the likelihood of the variant impacting splicing.SpliceAI scores range from 0 to 1 with scores closer to 1 being more likely to affect splicing.VCF files were further filtered to keep variants that met all of the following conditions: (a) population frequency less than 0.01; (b) variants with a SpliceAI score ≥ 0.2; (c) variants found in protein-coding genes; and (d) single nucleotide variants.Indels were not included as the majority had very poor quality and inclusion of indels introduced a significant number of false positives.

Filtering strategies
To find diagnoses in cases without a VUS or gene of interest, two separate strategies were used.The first was a genotype-to-phenotype approach.The annotated and filtered VCF files were converted into a BED file format, adding 25 base pairs up and downstream of the variant [chromosome start(-25 bp) end(+ 25 bp) gene].rMATSturbo results were also converted into BED format.After sorting the BED files, the variant BED file was overlapped with the rMATS-turbo results BED using bedtools intersect v2.30 [32] keeping only overlapping features.Each alternative splicing event identified by rMATS-turbo that was found to overlap a variant was then inspected in IGV as previously described.
The second strategy involved using phenotype information available to filter results from splicing tools.To do this, appropriate panels from the UK Genomic Medicine Service (GMS) PanelApp [33] resource were applied to the splicing tools results and each AS event was also inspected in IGV.

Expression outlier detection
Salmon v1.6.0 [34] was used to quantify gene and transcript counts in mapping-based mode.Transcriptome indices for Salmon were generated using the GRCh38 genome and transcriptome reference from GENCODE release 38 (https:// combi ne-lab.github.io/ alevin-tutor ial/ 2019/ selec tive-align ment/).The R v 4.1.1[35] package tximport v1.22.0 [36] was used to collate and import raw read counts from all samples to be used as input into OUTRIDER v1.12.0 [37].The OUTRIDER function filterExpression was used to remove genes that had low Fragments Per Kilobase of transcript per Million mapped reads (FPKM) expression values followed by the OUT-RIDER function which ran the full OUTRIDER pipeline.

Summary of RNA sequencing data outputs
The mean number of sequencing reads per sample was 76.6 million (61.3-130.2 million) and on average 80% of reads were uniquely mapping (Additional file 3: Fig. S1 A and B).Mean number of splicing junctions identified across samples (Additional file 3: Fig. S1 C) was 398,718 [161,, 303].Spearman's rank correlation between observed median TPM values and median TPM values found in the Genotype-Tissue Expression (GTEx) portal [38] was 0.79 with a p value < 0.001 (Additional file 3: Fig. S1 D).When considering disease genes from the Online Mendelian Inheritance in Man (OMIM) database [39] and the UK Genomic Medicine Service's PanelApp resource, 67% (n = 11,128) and 75% (n = 2721) of genes were expressed in blood respectively (TPM > 1 in at least 4 samples).
As mentioned previously, globin depletion was not applied to one of the batches (batch 3).Analysis of the transcriptomic profiles showed this difference in targeting methodology as samples within batch 3 clustered together in principal component analysis as well as hierarchical clustering (Additional file 3: Fig. S1 E and F).Furthermore, median TPM values for the most abundant haemoglobin genes were in line with values reported in GTEx, which also did not utilise globin depletion.To avoid bias due to differences in sequencing methodology, samples were run in separate batches through the splicing tools.TPM values across genes which had VUSs within our cohort were also assessed (Additional file 3: Fig. S2), which showed that gene coverage in genes of interest was not negatively affected by the lack of globin depletion and, in some cases, the TPM values were higher in batch 3. A possible explanation for the slight increase of reads in the non-depleted batch is that the globin depletion step could also be reducing the reads for some non-haemoglobin genes as well [40].A comparison of the whole transcriptome between batch 3 and the other three batches (Additional file 3: Fig. S3) demonstrated that globin depletion increased coverage of lower expressed genes (median TPM 0-1).Overall, 865 genes had a TPM = 0 in batch 3 and TPM > 0 in batches 1, 2 and 4. Of these, 214 were OMIM genes, and 67 of those were also GMS PanelApp genes.

Splicing analysis in patients with a candidate VUS
We began by looking at the 48 cases which had a VUS to guide our analysis of those for which we had no candidate variant.This entailed the investigation of 51 VUSs across 36 different genes, as some cases had more than one VUS.Using default parameters, the MRSD tool predicted that we would only be able to assess 47% (n = 17) of the 36 genes in blood based on our mean number of sequencing reads.However, we found this tool to be overly conservative as we were able to assess 69% of genes (n = 25) using RNA-seq alone.Using the GTEx dataset as a reference, median TPM values across the 25 genes ranged from 0.89 to 73.24 with a mean and median of 19.43 and 11.16 respectively.The experimental median TPM values in our sequencing data ranged from 2.142 to 75.896 with a mean and median of 15.718 and 7.637 respectively.
Visual inspection of the RNA-seq BAM files in IGV allowed the detection of aberrant splicing in 14 cases with candidate VUSs (Table 1).Of these splice-altering VUSs, 13/14 were predicted to affect splicing according to SpliceAI (Δ score ≥ 0.2) and all 14 were validated via RT-PCR (RT-PCR results for 6 of the 14 variants have been previously reported [12]).The gene with the lowest median TPM value in GTEx for which we were able to detect aberrant splicing was KAT6B with a GTEx TPM of 0.890 and an experimental median TPM value of 9.79.Out of the 38 VUSs where aberrant splicing events were not detected, 15 variants could not be assessed as a result of low gene expression in blood (< 10 reads covering locus or normal junctions not observed in the sample and controls).For genes where RNA-seq was uninformative, median GTEx TPM values ranged from 0.00 to 4.91 with a mean and median of 0.49 and 0.08 respectively.RT-PCR was able to validate aberrant splicing in four additional cases with variants in TERT, PRG4 and TAOK1.Details of all assessed variants can be found in Additional file 2: Table S1.Cases (n = 23) which showed no aberrant splicing linked to a VUS were subsequently analysed as unknown cases.
The DKC1 variant (NM_001363.5:c.915+ 10G > A) was initially not found to affect splicing using RT-PCR in our previous publication [12].However, once the effect could  be seen using RNA-seq, it was possible to design targeted assays to confirm the findings on RT-PCR.Thus, initial detection was not achieved by RT-PCR but post RNAseq confirmation was possible.
We ran the RNA-seq data through different splicing tools to identify the best tool/s to use when assessing cases with no candidate variants.The splicing tools rMATS-turbo, MAJIQ, FRASER2 and LeafCutterMD each identified 11, 4, 4 and 2 of the AS events respectively.rMATS-turbo had the best sensitivity identifying 79% of the AS events, where 7 were events identified solely by this tool.Details on the number of events called by each tool can be found in Additional file 1.
The aberrant splicing effects of three variants, SF3B4 c.417C > T, UBR4 c.8488 + 3A > G and SMARCE1 c.8-4A > G, were consistently missed by all tools.The SF3B4 variant is predicted to affect splicing (SpliceAI ∆ score = 0.37); however, this gene has a high GC content and low mappability in large regions of its exons.Admittedly, only two reads mapped to the new junction and there were seven reads with the mutant allele that did not show aberrant splicing.Nonetheless, this event was validated via RT-PCR [results previously reported [12]] and has also been characterised using a β-globin hybrid minigene assay [41].The UBR4 variant is predicted to cause donor loss leading to intron retention (SpliceAI ∆ score = 0.26).In IGV 46 reads with the mutant allele and loss of the donor site were observed, but the event was not detected by any of the tools (Additional file 3: Fig. S4).Lastly, the SMARCE1 variant is predicted to cause an acceptor loss (SpliceAI ∆ score = 0.22).Like the SF3B4 variant, it also has few (n = 9) reads mapping to the new junction.All three events were validated via RT-PCR and in the case of the UBR4 intron retention with additional qPCR.
Using the evidence from RNA-seq and RT-PCR results, we were able to identify patterns in the data that would help us determine potential thresholds and limitations when assessing cases with no VUSs.This included the following: (1) SpliceAI predictions showed high concordance (sensitivity and specificity of 94% and 91% respectively) with the RNA-seq and RT-PCR results; (2) in general, for VUSs which caused aberrant splicing, the variant was present in the data and reads with the mutant allele did not show normal splicing; (3) the splicing tools were able to detect aberrant events with as low as 5 reads supporting a new junction; and (4) intron retention has a higher probability of being missed compared to other aberrant splicing events.

Splicing analysis in patients without candidate VUSs
The patient cohort without a candidate VUS was comprised of 32 individuals plus an additional 23 cases where the original candidate VUS had not been found to cause aberrant splicing.To identify new candidate events in these cases, we took a systematic approach to narrow down the results obtained from rMATS-turbo to a manageable number so these could be inspected manually in IGV.rMATS-turbo was the preferred tool as it had the highest sensitivity in identifying aberrant splicing events ascertained in the patients with a VUS.rMATS-turbo identified an average of 3578 (2370-115,522) significant events (FDR < 0.05) per sample with an inclusion level greater than 0.2 or less than − 0.2.
Our first approach used filtered VCF files obtained by the RNA-seq variant calling pipeline to extract AS events within 25 base pairs of a variant.This first filtering step reduced the mean number of aberrant splicing events identified ~ 300-fold to an average of 12 events per sample.Inspection of all events in IGV led to the identification of two new variants and associated aberrant splicing events.Here we highlight the cases and events which were identified with our RNA-centric approach.

Case 1-S075 (NARS1): identification of a splice-altering variant in a child with undiagnosed global developmental delay
rMATS-turbo identified two AS events within the NARS1 gene.The first event was an alternative donor site within exon 13 and the second was retention of intron 13 (Fig. 2).These events were linked to a heterozygous missense variant within exon 13 (NARS1 c.1460C > T) predicted to affect splicing (SpliceAI ∆ score = 0.93) by creating a new donor site.Deleterious variants in NARS1 are associated with neurodevelopmental disorder with microcephaly, impaired language and gait abnormalities, which would be consistent with the patient's phenotype [42,43].NARS1 pathogenicity is generally associated with biallelic deleterious variants; however, a recent study by Manole and colleagues has shown that de novo variants, including a recurrent nonsense variant at the end of the protein, can have a gain-of-function effect that alters normal protein function by interfering with the ATPbinding domain, crucial for enzymatic function [42].In this case, the intron retention is predicted to lead to an out-of-frame transcript, while the new donor site is predicted to lead to an in-frame deletion of 19 amino acids, both affecting the ATP-binding domain.

Case 2-S047 (ARFGEF1): inclusion of a cryptic exon in a child with undiagnosed developmental delay
rMATS-turbo identified an AS event within the ARF-GEF1 gene associated with a deep intronic variant (chr8:67274263A > T, NM_006421.5:c.1337+ 1713 T > A).This particular case was originally referred for analysis of a VUS (NM_138927.4:c.1160C> T), which after assessment in IGV was not observed to cause aberrant splicing.The ARFGEF1 variant is not predicted to affect splicing (SpliceAI ∆ score = 0.0); however, the sequencing data shows the creation of a new acceptor and donor site within intron 9 suggesting the inclusion of a cryptic exon (Fig. 3).The inclusion of this cryptic exon would result in an out-of-frame insertion of 186 nucleotides p.(Ser447PhefsTer19).
Our second filtering strategy was a phenotype-togenotype approach.Using the phenotype information available, results from the splicing tools were filtered using the appropriate Genomic Medicine Service (GMS) gene panels.This strategy led to the identification of one new candidate variant and associated aberrant splicing events.

Case 3-S076 (AP4E1): identification of cryptic exon inclusion and a second frameshift variant in a child with undiagnosed hypotonia
The hypotonic infant GMS panel (v18.1)was applied to rMATS-turbo results, which identified activation of a pseudoexon within intron 1 of AP4E1 involving the use of one alternative splice acceptor site and two alternative donor sites (Fig. 4).The two resulting transcripts are predicted to be out of frame, leading to an insertion of 142 and 38 nucleotides.These events were associated with an intronic variant (chr15:50911536G > A, NM_007347.5:c.151-542G> A) weakly predicted to affect splicing.SpliceAI delta scores were 0.11 and 0.09 for acceptor gain (− 32 bp) and donor gain (5 bp) respectively, but these were just below the 0.2 cut-off.However, analysis of the mutated sequence using ESEfinder predicts that the G > A base transition identified at this position may act as an exonic splicing enhancer through the creation of a binding site for splicing factors SC35 (SRSF2) and/or SRp40 (SRSF5) [44].This event was not picked up by the first method, as the variant was filtered out due to stringent quality thresholds [genotype quality (GQ) < 16; variant had a GQ of 6] required to manage noise when calling variants in RNA-seq data.Considering this gene has a biallelic mode of inheritance, we interrogated the rest of the gene for a second deleterious event and found a heterozygous single-nucleotide deletion in exon 6, NM_007347.5:c.567del,p.(Leu190TrpfsTer43), predicted to lead to an out-of-frame transcript (Fig. 4C).Further testing confirmed that the variants are biparental.Biallelic variants in AP4E1 are associated with spastic paraplegia type 51, which is consistent with the phenotype information we have available for the proband.

Gene expression outlier analysis with OUTRIDER
Once aberrant splicing was systematically assessed we investigated if (1) gene expression profiles would generate new diagnostic candidates and (2) whether expression outliers correlated with aberrant splicing.OUTRIDER was run across the entire cohort (n = 85) excluding sample S017 for which only 16% of reads were uniquely mapping.OUTRIDER identified 175 gene expression outliers (FDR < 0.05) across 39 samples.Of the 39 samples that had expression outliers, 16 were cases with a VUS, 18 were cases without a VUS and 5 were cases with known molecular diagnosis.Ten cases within our cohort had known chromosome microdeletions previously identified through microarray analysis (4 cases with known diagnosis and 6 with unknown diagnosis).In 5/10 of these cases, OUTRIDER identified genes with significantly lower expression which overlapped the deleted regions previously identified (Fig. 5).For the 16 cases which had a VUS, none of the outliers identified matched the gene in which the VUS was found.A deeper analysis of the results revealed that two samples with splice-altering variants, NM_001363.5(DKC1):c.915+ 10G > A and NM_022356.4(P3H1):c.1224-80G> A were also found to have the lowest gene expression in the whole cohort for DKC1 and P3H1 respectively (the expression rank of the gene was 1 for that sample) but was not significant after correction for multiple testing (Additional file 3: Fig. S5).The OUTRIDER gene p value before correction was 0.0002 (z-score = − 3.76) and 0.0009 (z-score = − 3.31) for DKC1 and P3H1 respectively.The adjusted p values were calculated a second time using only OMIM genes resulting in a modest increase of 26 significant events across the whole cohort; however, none of the expression outliers overlapped genes where VUSs had been previously identified.
While gene expression profiles did not generate new diagnostic candidates, OUTRIDER results did lead to further investigation of one of the analysed cases (S064) to confirm skewed X-inactivation.This individual was a female child with developmental delay and dysmorphic features.Chromosome microarray analysis had identified a de novo 10.2 Mb deletion of Xp22.33p22.2.However, this copy number variant was classified as a VUS owing to the child being female and the assumption that the X chromosome carrying the deletion would be preferentially inactivated.Standard DNA-based X-inactivation testing proved uninformative in this case, but further primer sets showed unilateral inactivation.Trio wholegenome sequencing was subsequently undertaken to further seek a potential cause for the patient's condition.No candidate variant was identified.However, it was possible to use parental SNP data to determine that the Xp deletion had occurred on the paternal X chromosome.Analysis of 9 additional heterozygous expressed SNPs in the patient's RNA-seq data from loci across both arms of the X chromosome also revealed monoallelic paternal expression of X-linked genes (Fig. 6).This therefore confirms complete skewing of X-inactivation towards the paternally inherited X-chromosome carrying the 10.2 Mb deletion.The cause of this extreme skewing currently remains unknown, as no candidates were found on the maternal X.However, the deletion is now thought to be causative for the patient's presenting phenotype, resulting in a functional nullisomy for all genes in the deletion region that are subject to X-inactivation.

Discussion
In this work, we have systematically assessed patients with no candidate VUSs in clinically relevant genes identifying three new diagnostic candidates and one additional diagnosis through analysis of expression profiles.We show that it is possible to make diagnoses using just RNA-seq in patients without a candidate VUS as well as classify VUSs using blood-based RNAseq and RT-PCR to uplift diagnostic yield in rare disease patients.This work displays the variety of events that can be picked up using RNA-seq (i.e.deep intronic and exonic variants, complex splicing abnormalities, deletions, skewed X-inactivation) highlighting the wide range of applications this technology can have in the clinical setting.

Splicing analysis in patients with VUSs
High-throughput blood-based RNA sequencing allowed us to evaluate the effect on splicing of 37/52 VUSs across 48 patients in clinically relevant genes.Thirty-eight percent of assessed VUSs (n = 14) caused aberrant splicing detectable by RNA-seq, helping to clarify variant interpretation and provide supporting evidence of pathogenicity [45].For the 15/52 VUSs in which splicing could not be assessed using RNA-seq, the corresponding gene was not expressed in blood.In comparison to RNA-seq, RT-PCR proved to be more sensitive allowing us to assess 41 VUSs and confirmed a further four likely pathogenic AS events, meaning that in total 35% (n = 18) of VUSs in this cohort were found to affect splicing (see Additional file 2: Table S1).These figures are in concordance with SpliceAI predictions which had a sensitivity and specificity of 94% and 91% respectively.The increase in sensitivity of RT-PCR can be attributed to the targeted approach allowing amplification of AS events in lowly expressed genes [46], as well as amplification of AS events with low inclusion levels (in some cases accounting for nonsense-mediated decay).For the nine cases where neither RNA-seq nor RT-PCR was able to resolve the VUS, such variants would need to be assessed with other tissue types or via alternative methods such as minigene analysis or potentially using animal models should the collection of an appropriate or adequately representative tissue not be feasible [46].
The MRSD tool was used to predict the minimum required sequencing depth for genes of interest and was found to be very conservative.From our empirical data, genes with whole-blood TPM values of 5 or above are likely to be assessable for splicing analysis using the RNAseq parameters employed in this study, while genes with TPM values down to 0.9 may be assessable by RT-PCR.Within GTEx, a TPM threshold of 5 would correspond to 1104/3113 (35%) of genes listed in the UK Genomic Medicine Service's PanelApp list of disease genes, while a threshold of 0.9 would include 1866/3113 (60%) (see Additional file 3: Fig. S6).Based on our analysis, we recommend RT-PCR be the first-choice test to assess VUSs in genes with low expression in blood such as BRCA1, BRCA2 and FBN1.In some instances, informative RT-PCR results can be obtained even in genes reported to have a TPM value of zero in GTEx [46].However, in most other cases, RNA-seq is likely to prove more advantageous as a first-line test.RNA-seq can identify splicing events with more granularity, particularly when new AS events entail only one or a few nucleotides.Furthermore, with RNA-seq, we can quantify splice isoforms, identify expression outliers and most importantly, investigate aberrant splicing events without prior knowledge of the causal variants.Case 2 (S047) highlights the utility of transcriptome-wide data.In this case, this patient was referred with a VUS that did not cause aberrant splicing; however, the comprehensive transcriptome analysis revealed a likely disruptive splicing event in a different gene.
While the VUSs in this cohort were enriched for variants affecting splicing, these were clinically identified VUSs for which clarification of pathogenicity was sought by clinicians, highlighting the need for this type of test to be integrated into clinical practice.Overall, we were able to assess 86% of VUSs (RNA-seq [n = 37] and RT-PCR [n = 7] combined), confirming the utility of blood as a suitable tissue for validating aberrant splicing in rare disease patients.

Identification of splicing events linked to VUSs by different splicing tools
While we did not set out to benchmark a comprehensive selection of splice junction detection tools, we did however want to establish if widely used tools could be used to detect aberrant splicing in rare disease patients as datasets used in previous benchmarking studies were not comparable to ours [47][48][49].Using the 14 cases with aberrant splicing linked to known VUSs, rMATS-turbo had the highest sensitivity followed by FRASER2, MAJIQ and then LeafCutterMD.FRASER2 and LeafCutterMD were developed for outlier splicing detection and therefore it was unexpected that LeafCutterMD had the lowest sensitivity.Additionally, the tools' performance was evaluated using default parameters.Therefore, there is room to optimise parameters for improved specificity and sensitivity.There were three events which were consistently missed by the splicing tools (caused by variants: SF3B4 c.417C > T, UBR4 c.8488 + 3A > G and SMARCE1 c.8-4A > G) and it is likely that the low number of reads covering the junctions within SF3B4 and SMARCE1 is the reason the splicing tools are not picking up these AS events.We suspect that the low number of reads supporting the new splicing events is due to nonsense-mediated decay or leaky splicing; however, in these two cases, there were not any coding SNPs to confirm NMD.Long-reads may help but determining the difference between partial/leaky splicing and NMD, or possible feedback effects on decreased transcription using short-read data will require new analytical methods.Furthermore, regarding intron retention in UBR4, the presence of intronic reads in controls makes it difficult for the tools to distinguish noise from real intron retention events (see Additional file 3: Fig. S4).

Splicing analysis in patients without VUSs
In patients with no prior VUS, the sheer number of significant events resulting from splicing tools creates a challenge when identifying new potential aberrant splicing events that could be linked to the patient's conditions.However, we were able to filter these down to a manageable number and identify new likely disruptive aberrant splicing events albeit with strict filtering criteria, rendering it likely genuine events were missed.
Out of a total of 55 cases without a previously identified VUS or with a VUS but with no aberrant splicing observed, our RNA-seq analysis identified three cases with relevant splicing alterations and one case with skewed X-inactivation, suggesting a potential diagnostic uplift rate of 7%.This is an important untapped group of variants with few established high-throughput methods of analysis in these types of cohorts [50].We thereby demonstrate that it is possible to identify new candidate diagnoses and splicing events in patients with no prior candidate sequence variants, although with a much lower yield than if a VUS were previously identified.Two of the four new diagnostics candidates were caused by deep intronic variants, regions of the genome often overlooked in genomic investigations and where it is difficult to predict functional effects.If prediction algorithms are to be used for prioritising variants, these may need to be tailored by genomic region, such as having a more permissive SpliceAI score threshold for deep intronic variants.This is demonstrated by the activation of a cryptic exon caused by a deep intronic variant in the ARFGEF1 gene whose SpliceAI delta score was just below the widely used 0.2 cut-off.
The low diagnostic yield in patients without a candidate VUS could be attributed to several factors: (1) the patient could have a variant affecting splicing in a gene that is not expressed in blood or there is a tissue-specific impact that is not present in blood; (2) the molecular cause of the disease does not affect splicing; (3) the tools are not able to identify these events with high confidence (e.g.aberrant isoform is undergoing nonsense-mediated decay or difficult to align); (4) performance of the tools is variable, some events are picked up better than others (e.g.exon skipping compared to intron retention); and (5) the variant could have been filtered out.
We were limited to using RNA for variant calling as there was no matching DNA sequencing available for most cases in this study.Consequently, the high number of false positive variant calls led to strict filtering criteria where only SNVs were inspected and thus events caused by indels will have been missed.RNA sequencing may not be ideal for variant calling as it generates high numbers of false positive calls compared to DNA sequencing due to both biological and technical differences.Nonetheless, the use of gene panels to restrict results from the splicing tool (rMATS-turbo) did recover a variant that had been excluded due to harsh filters.This approach does limit the analysis by restricting to known disease genes and relies on having robust phenotypic information.If matched whole genome or exome sequencing along with detailed phenotypic information were available, integration of this data would likely increase events identified in this patient subgroup and potentially increase diagnostic yield.

Gene expression analysis
OUTRIDER detected half of the known microdeletions in the cohort; however, it did not identify significant alterations in gene expression for those genes with variants causing aberrant splicing.There were only two instances where the VUS gene was ranked first (lowest expression for the whole cohort), and although neither passed the significance threshold, this suggests a likely decrease in normal transcripts.This finding also indicates that abnormal splicing is not necessarily associated with a significant reduction in gene expression, at least in blood, and over-reliance on such expression changes for identifying splicing abnormalities is unlikely to have reliable sensitivity.This is particularly interesting as we would expect many of the splicing abnormalities to shift the reading frame and therefore undergo nonsense-mediated decay (NMD) significantly decreasing the abundance of the transcript.This lack of change in the expression of genes with aberrant splicing could be a reflection of biological mechanisms indicating that the impact of NMD is not as effective at depleting aberrant transcripts or it could be due to technical factors such as limited sensitivity of the tools as mentioned previously; the impact of NMD is not very pronounced in blood-based RNA-seq and tissue-specific RNA-seq is required; and/or the targeting methodologies bias the type of transcripts and number of transcripts we observe.Furthermore, whole blood has been shown to have high variability in gene expression profiles particularly when compared to skin fibroblasts [16].Some studies suggest that fibroblast RNA enables the investigation of a more comprehensive set of genes than whole blood and that this is likely the better tissue for detecting clinically relevant differences in gene expression [10,13,19].While blood-based RNA analysis may not be optimal, it does offer several benefits over fibroblasts.It is more routinely sampled, less invasive to obtain and does not require cell culture before testing, meaning it is faster to obtain and analyse and has lower requirements in terms of specialised knowledge and facilities.

Therapeutic applications
Accurate diagnoses facilitate appropriate clinical management, accurate genetic counselling and informed reproductive decision-making, but in some cases, there would be the potential for bespoke RNA-targeted therapies to be designed to correct a given splicing abnormality and slow or halt the progression of an individual's disease.Cases such as that of the AP4E1 cryptic exon inclusion variant highlighted in this study may be especially suitable targets in this regard, on account of the gradual neurodegenerative nature of the associated condition and the known efficacy of other antisense oligonucleotide therapies delivered to the central nervous system such as nusinersen [51,52].Notwithstanding, the substantial challenges and barriers facing the development of such bespoke therapeutics, precedent does exist for n = 1 oligonucleotide therapies [53].The utility of RNAseq in being able to identify these types of variants means that an effective personalised medicine healthcare system will benefit from having access to RNA transcriptomics within the diagnostic clinical setting.

Conclusions
To our knowledge, this study is the first to incorporate variant calling data from RNA-seq to results from splicing tools to identify new diagnostic candidates in rare diseases.While the diagnostic uplift is modest in patients with no known candidate variants in clinically relevant genes, our analyses suggest at least one-third of patients with rare disorders could benefit from the increased diagnostic yield offered by RNA-seq by providing additional functional evidence for VUSs.When considering the analysis of RNA, RT-PCR should be the first-choice test to assess VUSs in genes with low expression, but highthroughput RNA sequencing is more advantageous as a first-line test.Overall, we were able to validate splicing abnormalities in 35% [18/51] of patients with a VUS and identified four new diagnoses by detecting novel AS and expression events in patients with no candidate sequence variants, giving an overall uplift in diagnostic yield of 7% [4/55] in this subset of patients.We believe that RNA-seq should be considered as a complementary tool in genetic testing to uplift diagnostic yield in cohorts of patients with rare disorders particularly when integrated with other omic data.

Fig. 2
Fig. 2 Alternative donor site and intron retention in the NARS1 gene.A Sashimi plot of the proband and two controls of the alternative donor and intron retention region in NARS1.For the proband only (red track), we observed an alternative donor site in exon 13 as well as intron 13 retention.Alignments in exons are represented as read densities (not normalised) and splice junction reads are represented as arcs connecting a pair of exons, where the number in the middle of the arc shows the number of reads aligning to the splice junction.B IGV screenshot of RNA coverage across exons 13 and 14.C Close-up of NARS1:c.1460C> T variant, a deep exonic variant predicted to affect splicing by creating a new donor site within exon 13

Fig. 3
Fig. 3 Activation of cryptic exon caused by intronic variant in the ARFGEF1 gene.A Sashimi plot of the proband and two controls of the ARFGEF1 region of interest.For the proband only (red track), two novel splice junctions can be seen suggesting the activation of a cryptic exon in intron 9. B IGV screenshot of RNA coverage across the region of interest.C Close-up of the chr8:67274263A > T variant

Fig. 4
Fig. 4 Activation of pseudoexon caused by an intronic variant in the AP4E1 gene.A Sashimi plot of the proband and two controls of the AP4E1 region of interest.For the proband only (red track), three novel splice junctions can be seen suggesting the activation of a pseudoexon in intron 1. B Close-up of NM_007347.5:c.151-542G> A variant in IGV.C Heterozygous single nucleotide deletion observed in exon 6 (chr15:50929032delT)

Fig. 5
Fig. 5 Manhattan plot of RNA aberrant expression detection using OUTRIDER.The Y-axis represents the z-score.Coloured labels (red) indicate significant events (adjusted p value ≤ 0.05).The figure displays OUTRIDER results for A sample S086-proband with an array 5q31 deletion, B sample S070-proband with 16p11.2deletion and C sample S064-proband with array Xp22 deletion

Table 1
Variants of unknown significance (VUSs) for which aberrant splicing (AS) was observed in IGV.SpliceAI Lookup scores are indicated stating if donor loss (DL), donor gain (DG), acceptor loss (AL) or acceptor gain (AG) is predicted.Events that were identified by the splicing tools but had an adjusted p value > 0.05 are denoted with an asterisk (*)